Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Chromatin Immunoprecipitation Sequencing ◾ 215

The library preparation of the ChIP-Seq DNA fragments follows the same steps as

that of the whole genome sequencing (WGS), which includes fragmentation, end repair,

adaptor ligation, and enrichment. The sequencing steps follow the same steps used for

DNA sequencing by the sequencing technology. The sequencing raw data includes millions

of ChIP-Seq reads.

The sequencing strategies used for ChIP-Seq are the same as the ones followed for the

WGS and RNA-Seq. The design of the ChIP-Seq experiment is usually tailored to the condi-

tion studies and that design will guide the subsequent data analysis. The raw data produced

by the sequencer are raw reads in FASTQ files. Sequencing can be single end or paired end,

short reads (e.g., Illumina) or long reads (e.g., PacBio). However, most ChIP-Seq datasets

have been generated using single-end libraries and we should be aware that some programs

do not use paired-end libraries. The run can be for a single sample or multiplexed for sev-

eral samples; the fragments of each sample in the run are with a unique barcode.

6.3 CHIP-SEQ ANALYSIS WORKFLOW

In general, the ChIP-Seq analysis workflow includes raw data acquisition, quality control,

read alignment, alignment quality control, peak calling, combining peak calls, and final

analysis (visualization, motif discovery, and annotation and functional enrichment). You

are already familiar with the first four steps, which were discussed in detail in Chapters 1

and 2. ChIP-Seq raw data can be either provided by the sequencing facility for an experi-

ment or can be downloaded from a database. In either case, you may need to reprocess the

FASTQ files (refer to Chapter 1). There are several databases from which we can down-

load ChIP-Seq raw data submitted by other researchers either as supplementary material

for their publications or may be submitted as part of a project dedicated for investigating

some conditions and the data is made public for researchers. The NCBI SRA is the most

commonly used database for these purposes and it integrates most of the other public

databases. FASTQ files are downloaded from the NCBI SRA database using SRA-toolkit,

which we used in the previous chapters. ChIP reads can be subjected to quality control

by following the steps discussed in Chapter 1. FastQC program is used to assess the qual-

ity of the reads in the files. The reads then can be preprocessed, if needed, to remove the

reads with low quality, trim the low-quality ends or adaptor sequence, and remove dupli-

cate reads and other technical reads. The step of quality control is always crucial as in

other sequencing applications to avoid misleading interpretation of results. Read mapping

is performed by aligning ChIP reads and control reads to a reference genome. The same

aligners (e.g., BWA and Bowtie) used for aligning reads from WGS can also be used for

ChIP-Seq data (refer to Chapter 2). The alignment information produced by the aligner

is stored in SAM/BAM files. Before proceeding to the next step, we can remove duplicate

reads. Duplicate reads are generated from a single read; they are identical and aligned to

the same region forming low library complexity in that region. For ChIP-Seq data, reads

are aligned upstream and downstream around the binding sites, leaving the regions of

the binding site with low sequence coverage. The read pileup density (also called signal)

around the binding sites should form bimodal enrichment patterns, with Watson strand

tags enriched upstream of binding and Crick strand tags enriched downstream. The shape